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ABSTRACT 

In this paper we describe an approach to automatic evalua- 
tion of both the speech recognition and understanding capa- 
bilities of a spoken dialogue system for train time table infor- 
mation. We use word accuracy for recognition and concept 
accuracy for understanding performance judgement. Both 
measures are calculated by comparing these modules' out- 
put with a correct reference answer. We report evaluation 
results for a spontaneous speech corpus with about 10000 ut- 
terances. We observed a nearly linear relationship between 
word accuracy and concept accuracy. 

1. INTRODUCTION 

Total system evaluation plays an important role for devel- 
opers of spoken dialogue systems, because it allows both to 
monitor progress within a single project and to compare dif- 
ferent solutions for the same task. An objective and verifi- 
able judgement of system performance requires that the sci- 
entific community agrees upon widely accepted evaluation 
measures. In speech recognition, such a mutually agreed 
upon measure is available with the so-called word accuracy 
(WA). There exist standardized tools which can automati- 
cally compute the WA of recognition results for a given test 
corpus annotated with transcriptions of the actually spoken 
words. This high standard of automatic evaluation methods 
could not yet be transferred to the higher processing level 
of speech understanding, although the last few years have 
witnessed increasing efforts in the development of an evalu- 
ation methodology for natural language processing (cf. ||, 
0,1)- 

This paper describes our approach to automatic evaluation 
of both the recognition and the understanding capabilities of 
a spoken dialogue system for train time table inquiries J3|. 
Such an integrated evaluation environment allows a system- 
atic investigation of the relationship between recognition and 
understanding performance. The central question is: How 
does a change in the recognition accuracy affect the under- 
standing accuracy? First we describe the evaluation mea- 
sures word accuracy and concept accuracy. After this we 
show our evaluation architecture for automatic calculation of 



recognition and understanding accuracy. Finally, we report 
results for a spontaneous speech corpus containing about 
10000 utterances. 

2. EVALUATION MEASURES 

Automatic evaluation methods require the use of prepared 
test corpora in which each test case is combined with a "cor- 
rect" reference answer against which the system output can 
be judged. In speech recognition, it is relatively uncontro- 
versial how this reference answers look like: they are tran- 
scriptions of the words that were actually spokenj^ It is 
less clear, however, what constitutes the "correct" analy- 
sis at the level of language understanding. Currently, there 
is no agreement among computational linguists regarding a 
"correct" semantic representation for a wide variety of lin- 
guistic phenomena. As a consequence, there are no semanti- 
cally annotated corpora available as a common test bed for 
comparative evaluation of linguistic processing components. 
Nevertheless, we believe that an objective and verifiable mea- 
surement of the understanding capabilities of a system can 
only be achieved with a "reference answer" -based approach 
using test corpora with semantic annotations. This convic- 
tion is based on the fact that the main task of the linguistic 
processing component in a spoken dialogue system is to map 
the spoken input to a semantic representation. Evaluation 
approaches which look only at the surface forms^] or the syn- 
tactic structures Q of the parsing results cannot judge the 
parser performance regarding the construction of a semantic 
representation. Therefore, we defined a semantic annotation 
format within our task domain. For measuring the under- 
standing performance we adopted the so-called concept accu- 
racy. This measure, which was proposed from the evaluation 
working group of the ESPRIT project SUNDIAL 0, can 
be calculated automatically in analogy with the recognition 
measure word accuracy. 



1 There are still debates on the transcription and evaluation 
of spontaneous speech containing fragmentary words, hesitations, 
background noise, etc. 

2 In H a word graph parser is rated by calculating the sentence 
recognition accuracy, which is denned as "the number of word 
graphs where the analysis found the spoken sentences divided by 
the number of word graphs" . 



2.1. Word Accuracy 

Word Accuracy (WA) is a widely accepted evaluation mea- 
sure for word recognizers. The automatic calculation of WA 
for a given set of recognition results requires the existence of 
reference transliterations for all spoken utterances. The ref- 
erence answers consist of a transcription of what was actually 
spoken. Given the reference REF, the WA of the recognizer 
output HYP is determined by calculating the Levenshtein 
distance between REF and HYP and by assigning equal costs 
to substitution, insertion, and deletion errors. WA is calcu- 
lated as a percentage using the formula 



(1) 



WA = 100 1 



W s + Wi + Wo 
W 



% 



where W is the total number of words in REF, and Ws, Wi, 
Wd are the number of reference words which were substi- 
tuted, inserted, and deleted in HYP, respectively. 

For example, the WA of the recognized string in (j2|) is 66.7%, 
since the spoken word I was deleted and the spoken word 
Berlin was substituted by Bonn in HYP, such that Wd = 1 
and Ws = 1. By inserting these values into formula (Q) the 
WA is calculated by 100 (l - §) = 66.7%. 



C2) 



REF: 


1 want to go to 


Berlin 


HYP: 


want to go to 


Bonn 



2.2. Concept Accuracy 

While WA evaluates the performance of the speech recogni- 
tion component, the language understanding capabilities of 
a system can be judged by concept accuracy (CA).^| This ap- 
proach is based on the assumption that the main task of the 
linguistic processor in a spoken dialogue system is to extract 
the propositional content from the spoken utterance. Fur- 
thermore, it is assumed that this propositional content can be 
represented as a list of semantic units (SU) taking the form 
of attribute-value pairs. The definition of the attributes rel- 
evant for understanding is determined by domain-dependent 
task parameters which reflect the functionality of the sys- 
tem. For example, in a train time table information task the 
system cannot access the connected database system with- 
out knowing the values for the task parameters sourcecity, 
goalcity and date. Accordingly, the propositional content 
of a sentence like (^) is represented as the series of SUs shown 
in (|. 

(3) I want to go from Bonn to Berlin. 

(4) [sourcecity : Bonn, goalcity : Berlin] 



Given such semantic reference answers in form of task 
parameter-value pairs the performance of a speech under- 
standing component can be measured in analogy with the 

3 In ]lo| a similar measure was called information content. 



method used for word recognition evaluation. Concept ac- 
curacy CA can be calculated by replacing the words W in 
formula (^) with semantic units SU : 



(5) 



CA = 100 1 



SU S + SUi + SUp 
SU 



% 



SU is the total number of semantic units in the reference 
answer and SUs, SUi, and SUd are the number of seman- 
tic units that were substituted, inserted, and deleted in the 
parser output, respectively. The calculation of CA will be 
illustrated in the following example: 



(6) 



Spoken: 
REF: 


No to Bonn 
dmjnarker : no goalcity : Bonn 


Recog.: 
HYP: 


No to Berlin 
dmjnarker : no goalcity : Berlin 



The total number of uttered semantic units in (|6|) is SU = 2. 
Due to the misrecognition of the spoken word Bonn the cor- 
rect semantic unit goalcity : Bonn was replaced by goal- 
city: Berlin in the parser output, thus being SUs = 1. 
This yields a concept accuracy of 50% by calculating CA = 
100 (l - |) % = 50%. 

The example shows that beside its ability to judge the parser 
performance on a semantic level, CA is also an adequate mea- 
sure for evaluating robust parsers which allow partial anal- 
ysis. This is a distinguishing feature of CA in comparison 
with binary measures like sentence recognition accuracy. In 
such approaches a system output either totally agrees with a 
reference answer or it is counted as a total failure. Concept 
accuracy on the other hand is able to measure the degree of 
system understanding. In the above example, 50% CA ex- 
presses the fact that the chain comprising word recognizer 
and parser was able to extract half of the propositional con- 
tent from the input utterance. 

2.3. Word Accuracy vs. Concept Accu- 
racy 

The example shown in the previous section illustrates that 
the relationship between WA and CA cannot be predicted 
systematically. Both measures can differ considerably be- 
cause WA does not make a difference between filler words 
and semantically relevant words. For example, WA in (Jfj]) 
is 75% (only 1 substitution error), whereas CA is only 50%. 
This is explained by the fact that the substituted city name 
forms the semantic core of the goalcity-concept which is 
misunderstood as a whole in consequence. The opposite 
case is illustrated by example (Q) where WA — 66.7% but 
CA = 100% because the misrecognitions did not concern the 
parts relevant for understanding. 



(7) 



Spoken: 
REF: 


1 want to go to Berlin 

goalcity : Berlin 


Recog.: 
HYP: 


1 wonder go to Berlin 

goalcity : Berlin 



The example shows that it is possible to achieve perfect ut- 
terance understanding with less than perfect word recogni- 
tion. This happens when misrecognitions only affect seman- 
tically irrelevant (in our domain) filler words. On the other 
hand, if recognition errors occur within parts that are rele- 
vant for understanding an utterance, CA may become lower 
than WA. This relationship between WA and CA was investi- 
gated in the experiments we describe in section ^. These ex- 
periments were performed with the evaluation environment 
and the data described in the next section. 



understanding 
spontaneous 
speech] 




3. EVALUATION ENVIRONMENT 

We implemented a test environment which can automatically 
calculate the concept accuracy of the parsing results for a 
given semantically annotated test corpus. The architecture 
of our automatic evaluation system is outlined in Figure jjj. 

The test corpus consists of a set of test cases, which are either 
transliterations of the spoken utterance or word recognition 
results. In the first case the environment is used for evalu- 
ating the linguistic component alone, in the latter case word 
recognizer (FEP) and linguistic processing component (LP) 
are evaluated together. In both cases each test sentence is 
combined with a semantic reference annotation in the form of 
attribute- value pairs shown above. The test cases are handed 
over sequentially to the parser which tries to analyze it with 
respect to its knowledge base, i.e. the grammar. At the mo- 
ment we use a robust chart parser which selects a set of 
partial results from the chart if no complete analysis can be 
found. This parser uses a highly lexicalized unification gram- 
mar based on the UCG formalism j^] . The strict modularity 
of the evaluation environment allows an easy replacement of 
test data as well as of the linguistic processing component. 
Thus, although we use the evaluation programme mainly for 
progress evaluation, it can also be used for comparative eval- 
uation of alternative implementations of the lingusitic com- 
ponent. The only requirement is that the components gen- 
erate comparable results in the semantic interface language 
(Sil, [Q]) used in our dialogue system. In order to com- 
pare these complex parsing result structures with the much 
simpler reference annotations, we implemented a (domain 
specific) module sil2ref which maps between Sil and the 
annotated semantic units. Finally, the parsing results and 
the semantic annotations are compared by calculating the 
Levenshtein distance by programme eval_seg. The result- 
ing concept accuracy is reported (cf. Figure hi). 

4. EXPERIMENTS & RESULTS 

In our evaluation experiments we wanted to examine the re- 
lations between WA and CA, in order to see if improvement 
of the word recognizer (and thus WA) also leads to improve- 
ment of concept accuracy. Therefore several evaluation tests 
were run. Based on the same speech material we run the 
recognizer with different parameter settings, resulting in dif- 
ferences in word accuracy (and processing speed). These 
word chains have been processed by the linguistic processor 



Figure 1: Architecture of the automatic evaluation system. 

and corresponding figures for WA and CA were calculated. 

Evaluation was performed on a test corpus collected while 
the system was accessible via the public telephone net- 
work |Q. 1092 dialogues with (naive) users were recorded. 
We recorded the word recognizer output, the transliterations 
and the semantic annotation for each utterance were done 
manually. Table fi] gives an overview of the test corpus. 



Total number of dialogues 


1092 


Total number of utterances 


10114 


Total number of words 


33477 


Total number of semantic units 


14584 


Different classes of semantic units 


38 



Table 1: Figures of the test corpus. 



The first step was to evaluate the linguistic component of the 
system on its own, in order to measure the (semantic) cov- 
erage of the grammar. The resulting figure for CA reflects 
the grammars ability to extract the meaning of an utterance 
and thus its adequacy for the given domain. For this pur- 
pose, CA was computed using the transliterations as input to 
the parser and comparing the resulting semantic representa- 
tion with the reference annotation. We achieved a linguistic 
coverage of 92.8% for spontaneous speech. 

In order to examine the influence of different recognizer pa- 
rameters on the systems concept accuracy, several experi- 
ments were carried out. The recognizer parameter to be 
altered was the beam width. For each parameter setting the 
recognizer was run on the recorded 10114 utterances of the 
corpus. Concept accuracy was then measured using the re- 
sulting recognizer output as input to the parser. Table ^ 
shows the resulting marks for WA and corresponding CA. 



WA 


48.8 


65.7 


72.9 


77.5 


83.0 


84.9 


CA 


46.7 


61.9 


68.2 


73.0 


78.5 


79.8 



Table 2: Resulting marks for WA and CA when altering 
the recognizer beam width. 

Table ^ shows that the marks for WA and CA correspond 
closely. This means that in our case the misrecognition in 
the acoustic front end processor affects content words and 
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Figure 2: Relationship between word accuracy and concept 
accuracy. 



filler words by the same amount. Moreover, we can see that 
the linguistic processor does not suffer from misrecognition 
of a few words. The parser has to be judged as extremely 
robust against recognition errors as well as phenomena of 
spontaneous speech. Figure]^ shows the nearly linear relation 
between word accuracy and corresponding concept accuracy. 

In our case we can make the assumption that word accuracy 
is a suitable indicator for concept accuracy in a spoken di- 
alogue system: recognizer and parser are well matched for 
their tasks and cooperate smoothly. 

5. SUMMARY 

In this paper we have shown an approach for the auto- 
mated evaluation of an understanding module for sponta- 
neous speech. This module consists of an acoustic recognizer 
and a linguistic processor. The resulting semantic content 
of each utterance is compared automatically with reference 
annotations, mimicking the evaluation of a word recognizer 
alone. Accordingly, the measure for a speech understanding 
system is called concept accuracy. 

With our evaluation setup we are able to document improve- 
ments in one of our modules in an automated way. Thus, we 
are not only able to optimize isolated modules, but the whole 
understanding system. Experiments show that our parser is 
robust in the sense that we observe a nearly linear relation 
between WA and CA. 

Further work will be commited to adjust parser parameters. 
Eventually we hope to increase CA beyond WA. 
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